Wednesday, October 05, 2011
Video of SciPy 2011 talk
At the end of the talk, Andy Terrell asked a question about why this structured approach is necessary - why not just program the steps directly in a script?
I agree that the system I presented does seem very 'heavyweight'. But it feels that doing the transformations directly loses some meta-information that would be useful to keep around, such as which transformations were applied at each step. Additionally, by storing the the pieces of the derivation in data structures, they are potentially easier to manipulate (and emphasizes that they can and should be programmatically manipulated). However, Python's meta-tools are quite powerful, and could probably used to manipulate derivations written directly in a script.
Switching to a different issue, the presented example effectively inlines all the functions into the integral. This is clearly not scalable in the face of larger and more complex expressions. The system needs some way to represent pieces that should transform into a function call in the generated code. I have something working now, but it needs some clean-up work.
Tuesday, June 07, 2011
Modeling derivations: Questions and Answers
- What is the target application? Goals?
The target goals are better enabling algorithm research and development (that is, writing a scientific code), and a flexibility to target different runtime systems.
Looking at the examples and case studies on various computer algebra system (CAS) websites, I see many cases where the user modeled some system, solved it, and then used the solution to improve or fix the system. The hard part seems to be modeling the system - this is not my target. I'm interested in problems where the second step is the hard part - and one needs to research new and better solvers.I'm thinking in terms of a traditional scientific code, and how to automate parts of the development workflow. One big issue is how to use a CAS to assist in the development. Also I'm interested in problems that require high performance computing (HPC) - basically problems that will consume as much computing power and algorithm research as possible. This likely means the code will need to be tuned and altered to support different target systems (multicore workstations, distributed memory parallel clusters, accelerators like GPU's, etc).
- Why not use a compute algebra system directly? It looks like you're creating a lot of extra work for yourself?
I would like a separation between the specification of an algorithm (for scientific computation, this is largely expressed in equations) and its implementation in the runtime system (along with an automated transformation from the spec to the final system).
Typically a CAS is associated with full featured programming language, but implementing the algorithm in that language doesn't preserve this separation.
As a concrete example, a CAS typically includes facilities for performing numerical integration. What if I want to develop a new numerical integration method - are there any examples on how a CAS can be used to do that?
Another issue with implementing HPC-style problems in a CAS is what happens when the existing solvers in the CAS are too slow, or the CAS system language is too slow (eg, how to scale to large parallel systems). The CAS may do a great deal of work for you, but what happens when you want to move past it? This is the part where code generation comes in. - What is the role of code generation in this system?
More generally, the 'code generation' step is the 'transformation to the target runtime system' step. The transformation may be trivial, if the final equations can be solved by the CAS facilities, but I'm anticipating the need to ultimately target C/C++ or Fortran.
Also, it seems useful to introduce this method in small pieces into an existing code (converting an entire code would usually be too dramatic of a change to be successful). It would smooth the transition for the generated code to fit the style of the existing code. A template system for generation would be a useful approach. A text-oriented (or basic token oriented system, like the C pre-processor) would work, but they have problems. A system oriented around an actual parse tree of the target language would allow the most powerful transformations. - Is this related to reproducible research in any way?
Packages for reproducible research seem to be focused on the level of capturing parameters that are input to a scientific code, so that the results can be repeated and revisited later.
Executing a program may be reproducible, but the construction of that program is not.
Aside:
Reproducibility can be achieved through documentation or automation. The previous standard for scientific work depended primarily on documentation. With computers, automation becomes possible as a working method. (For space reasons, this analysis is very simplistic --- there is more to it.)
</aside>
This work is intended to make the construction of a code automated. One advantage is in making corrections to mistakes in the derivation - fix it, push a button, and the final code is automatically updated.
This should be a complementary to other systems for reproducible research. - Why are you doing this?
I'm tired of tracking down missing minus signs and misplaced factors of 2 in my code. A significant feature of this approach is that the entire chain of derivation and generation of the program is performed by machine, to eliminate transcription errors. (If this sounds too rosy, rest assured there are plenty of other types of errors to make).
Saturday, April 23, 2011
Example of modeling derivations
This post will provide an example using the configuration integral for the canonical ensemble in statistical mechanics. I want to evaluate the integral using various methods, starting with grid-based integration methods. Using a grid-based method suffers from the curse of dimensionality, where adding particles raises the time to compute exponentially (which is why Monte Carlo methods are normally used). However, it can still be useful to check Monte Carlo codes with other integration methods for small particle numbers. Also, the free energy (and all associated quantities) is computed directly, where it is more difficult to compute with MC methods. Finally, I'm curious as to how large of a system could be computed on today's hardware.
Sympy is the computer algebra system I'm using. The code for this example is available on github in the derivation_modeling branch, in the prototype subdirectory.
The output (MathML and generated python) can be seen here.
The flow is as follows:
- Derive the composite trapezoidal rule, starting from the expression for a single interval
- Generate code for the composite trapezoidal rule
- Start from the configuration integral (which I've called the partition function in the documents) and transform it by specializing various parameters - 2 particles in 2 spatial dimensions, using the Lennard-Jones interaction potential, etc - until it becomes completely specified numerically.
- Generate code that evaluates this integral using the generated integration code from step 2
There's still a tremendous amount of work to do on this, but hopefully an outline of this style of programming is visible from the example.
- The MathML pages have a lot of detailed small steps - it would be nice to have a collapsible hierarchy so all the little steps can be hidden.
- Some abstractions should remain during the steps - in particular the potential. The expression for the integrand can get quite messy from explicitly inserting the expression for the potential (and this is a simple case - more complex potentials would make it unreadable).
- More backends for the code generation - C/C++ is the obvious choice in pursuit of higher performance. (The generated python can be run through Shed Skin, for another route to C++). Converting to Theano expressions might be a quick way to try it out on the GPU.
Thursday, April 07, 2011
Modeling derivations
A derivation starts with some base equation, then proceeds through a series of steps transforming it into a form that can be solved numerically (A symbolic/analytic solution would be nice, but not possible in most cases.) The steps can be divided into three categories
- Exact transformations - rearranging terms, replacing an expression with a different form, the usual steps involved in a proof, etc
- Approximations - using a truncated series expansion, replacing a derivative with discretized version, etc.
- Specializations - Specifying the physical or model parameters in order to make the equation numerically solvable - like the number of spatial dimensions, number of particles, interaction potentials, etc.
At the end of the derivation, the result can be evaluated numerically using the CAS (Computer Algebra System - I assume this derivation modeling has been implemented using one) facilities, if simple enough. More complex or time-consuming results (a more likely case) can be further transformed by code generation to C/Fortran (or some other execution system).
One important feature of this workflow is every step in creating the program is checked by machine. This should greatly reduce transcription errors (dropped minus signs and factors of 2 - this is what started the whole project in the first place). Also it moves part of the process of creating scientific programs into the realm of 'ordinary software', and as such makes it subject to normal software engineering techniques - unit testing, etc. Testing scientific (numerical) software is a difficult problem - this should help make it (a little) easier and more reliable.
Some other work I've run across related to describing mathematics:
- Structured derivations were developed for teaching proof in high school mathematics.
- OMDoc is a markup language for describing mathematical structures at a higher level than a single formula (a proof, for instance)